Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

Merged
merged 3 commits into from Jun 5, 2018

Conversation

colemanw
Copy link
Member

@colemanw colemanw commented May 12, 2017

@eileenmcnaughton
Copy link
Contributor

I'm going to try to take a look at this next week

@colemanw
Copy link
Member Author

Great, thanks @eileenmcnaughton .

$this->addRadio('contact_ajax_check_similar', ts('Check for Similar Contacts'), array(
'1' => ts('While Typing'),
'0' => ts('When Saving'),
'2' => ts('Never'),
Copy link
Member Author

@colemanw colemanw May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: In all existing installs this preference is already set to either '1' or '0'. These new options keep the original meaning of 1 and 0 so no upgrade script is needed.

@colemanw
Copy link
Member Author

colemanw commented Jun 3, 2017

Thoughts on this one @eileenmcnaughton ?

@eileenmcnaughton
Copy link
Contributor

Sorry - I still have this on my radar - I'm expecting to do significant work on dedupes going into next quarter - with extracting to an extension & making all interaction with the dedupeFinder & merger via api being high priorities

@colemanw
Copy link
Member Author

@eileenmcnaughton I've rebased this. Can you take a look?

@eileenmcnaughton
Copy link
Contributor

I've started looking at this & in general I agree with the change. I'm still working through it so these are not final comments.

  1. I note with the api call we need to pass 'check_permissions' to the contact.get call - perhaps it will filter out some info, or not, but it's the right thing to do.
  2. I'm not a fan of how that function does different things depending on return. I'm kinda inclined to deprecate it rather than hack it. Then we would wind up with 2 functions eventually

Contact.getmatches
Contact.getduplicates

The former is this function - matches for the data we have. The second would be pairs of duplicates to potentially merge (like the merge screen uses but less nasty). (The second obviously does not exist as yet.).

  1. I want to look a bit more closely at a tidy up of the admin form for this.
  2. From a user point of view we have the situation where if I have the default settings I no longer get presented with duplicates until I have entered first name, last name AND email. This is a sort of regression from getting them as soon as I've entered last name.

In general the defaults for Supervised & Unsupervised seem to be the reverse of what they should be. ie. Unsupervised is happy to given anyone the bash if there is a matching email whereas Supervised is conservative about putting up options. I'm not quite sure how to deal with this but as a user I feel like I would want matches as soon as I've entered first name in & then the top matches should get increasingly specific as I provide more data.

@lcdservices I think you understand the thinking behind Supervised vs Unsupervised - which I'm grappling with.

@colemanw
Copy link
Member Author

  1. You're right. I've added it.

  2. Actually it returns the same format. If you leave off return then it sends back:

    [
    12 => ['id' => 12],
    34 => ['id' => 34],
    ]

and if you add e.g. return => ['display_name', 'email']` then it sends back:

[
  12 => ['id' => 12, 'display_name' => 'Joe Tester', 'email' => 'joe@test.er'],
  34 => ['id' => 34, 'display_name' => 'Jim Tester', 'email' => 'jim@test.er'],
]
  1. Ok. Note my inline comment in the PR about that.
  2. Well the code in this PR doesn't do that, it starts looking for matches as soon as you enter something in any field. But the supervised dedupe rule might get in your way depending on how you've set it up. A more lax rule will indeed give more results. What would be really nice would be to sort those matches by weight (e.g. closer matches first) but our BAO doesn't currently do that I don't think :(

@eileenmcnaughton
Copy link
Contributor

Regarding 4 - I'm just working off what people have as the default out-of-the-box settings. I realise people can customise it but most don't. The experience would be that they are used to having it resolve when they enter last name & now it won't - so it would feel like a bug/regression to a lot of people.

@eileenmcnaughton
Copy link
Contributor

(your reply to 2 makes sense & I agree that is fine)

@colemanw
Copy link
Member Author

I just tested and you're right about number 4. It's not very useful that it only works when all 3 fields are an exact match. I honestly don't know how that qualifies as a "supervised" rule as anyone with exactly the same first&last name and email is unquestionably the same person.

That said, what we had in place prior to this PR wasn't exactly gold-standard either. It matched on last name and nothing else, so Jan Smith would match Bob Smith, etc. - also not very useful results, and not at all configurable as it is now.

@eileenmcnaughton
Copy link
Contributor

My instinct is that the Supervised & Unsupervised rules are messed up & were implemented in reverse of how they were conceived - the reason I was hoping to get input from @lcdservices

I feel like this WOULD feel like a regression to some users as is & I feel like we should consider adding a third rule type that exists by default called 'InputAssistance' or something & making it match on last name or email being sufficient.

Overall I feel like this might not be universally better & we should probably solicit wider input. Possibly from the dev list although it's more a UI question than a dev question but our only other list seems to be 'partners' which is definitely not right and please not a double post....

From a code point of view I'm happy with the code apart from wanting to dig a bit further into the admin form side of things and agree it is an improvement. I just realised I haven't tested that we still do awful ghastly things to people when they choose ' on save ' for the setting

@lcdservices
Copy link
Contributor

lcdservices commented Oct 20, 2017

@eileenmcnaughton sorry -- missed the earlier ping on this. I'm happy to give my 2 cents on the rules... ;-)

  1. I've never liked that the unsupervised rule was so loose by default (email only). I argued for stricter default rules many versions ago but Dave G felt that since email is the only field required on contrib forms and event reg forms, and the inclusion of profiles is optional, we shouldn't create a default dedupe rule that may be invalidated by the default configuration of contrib pages/events. While I appreciate that argument, I would argue we should opt for a more strict rule regardless, as the unsupervised rule should always err on the side of being more restrictive (to prevent false positives). Further -- since that discussion several years ago we've added a validation rule so that if the user selects a dedupe rule where the fields are not present in the form via the profile, we throw a warning that the configuration is not capable of creating a match.
  2. In principle, I agree with @eileenmcnaughton that the two rules are better switched. Unsupervised should match on more values, Supervised on less.
  3. I also agree that the supervised rule would benefit from being more nuanced -- which I gather is the impetus for this PR. personally I think a match on first name + last name is a sufficient default rule.
  4. the two default "optimized" rules would benefit from some review. those were created in conjunction with the creation of the dedupe hook, (v4.2?). we had that hook added for NYSS and proposed the optimized rules at that time (that's when the discussion about defaults took place). but I feel like they might benefit from a code review. (I realize that's not the purview of this issue)

@lcdservices
Copy link
Contributor

One more comment after reading the JIra issue --
I agree that the lookup should be done via ajax or postprocess validate, but not both -- and I'd vote for the AJAX method. in my thinking, the ajax method should derive from the supervised rule, and should only be triggered when all fields in that rule actually have data in them. so if the rule is defined as first + last name, it should not be triggered after the first name is added, but only after both first and last have been added.

@eileenmcnaughton
Copy link
Contributor

So to attempt to pull things together - I think both @lcdservices & I would be happy to see only TWO options

  1. ajax checks
  2. no check

I think you'd struggle to find anyone who likes the on-save validation - but we could ask.

  1. We don't think the default rules are right (which would affect usablity of this) but the question is - how do you deal with that. The best thing I can think of is to add a new type that is 'fit-for-purpose' by default. I feel like last name OR email match would be a good default for a data-entry rule (although there might be performance issues with that default on larger sites).

@colemanw
Copy link
Member Author

colemanw commented Oct 23, 2017

I agree about a made-for-deduping rule. I actually custom coded one of those for Woolman years ago. It did all kinds of fuzzy OR logic, taking phone numbers, nick names, email, etc into consideration. It wasn't very performant but it was awesome at nailing hard-to-find dupes
https://github.com/woolman/drupal-modules/blob/master/woolman_website/woolman_dupe_query.inc#L54.

@eileenmcnaughton
Copy link
Contributor

I had another go at this and the usability seemed ok if I created a new for-purpose rule & used that instead of the 'Supervised' rule (in practice I did this by editing the Supervised rule to have a threshold of 5 & deleting CRM/Dedupe/BAO/QueryBuilder/IndividualSupervised.php). Using the existing built in rule didn't give me any feedback until all 3 fields matched which was a worse user experience than without the patch.

With my 'new custom rule' the performance was 'ok' even on the wmf database. By that I mean that on many names it resolved in real time but even on 'John Smith' it didn't cause queries to spin off and bring the database down. I did get a very strange UI experience on the slow ones as nothing showed up & then suddenly I had 3 matching boxes of matches.

However, the problem with the code as is is that the names are sorted by Sort name. Since there are more than 20 Eileen's in our database the list of Eileen's displayed once I entered first_name did not include me. Once I entered my email it fired again - but presented the list of names in the same order so I was still only to see the first 20 possible matches which include me.

I tried to make it sort by weight so the best matches would float to the top - the code is my efforts, but while I managed to get the api to sort by weight the UI still did not.

Doing data entry the whole thing would work better if the email address field were the first one you hit because then it could try to match on that first & also I think it's realistically more important for dataentry than fields like Job Title & Contact Type - tangental...

diff --git a/CRM/Dedupe/BAO/RuleGroup.php b/CRM/Dedupe/BAO/RuleGroup.php
index 7557af3..f374863 100644
--- a/CRM/Dedupe/BAO/RuleGroup.php
+++ b/CRM/Dedupe/BAO/RuleGroup.php
@@ -211,8 +211,7 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
     $exclWeightSum = array();
 
     // create temp table
-    $dao = new CRM_Core_DAO();
-    $dao->query($tempTableQuery);
+    $dao = CRM_Core_DAO::executeQuery($tempTableQuery);
 
     CRM_Utils_Hook::dupeQuery($this, 'table', $tableQueries);
 
Discard this hunk from worktree [y,n,q,a,d,/,j,J,g,e,?]? q

wmf1411:civicrm emcnaughton$ git diff
diff --git a/CRM/Dedupe/BAO/RuleGroup.php b/CRM/Dedupe/BAO/RuleGroup.php
index 7557af3..f374863 100644
--- a/CRM/Dedupe/BAO/RuleGroup.php
+++ b/CRM/Dedupe/BAO/RuleGroup.php
@@ -211,8 +211,7 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
     $exclWeightSum = array();
 
     // create temp table
-    $dao = new CRM_Core_DAO();
-    $dao->query($tempTableQuery);
+    $dao = CRM_Core_DAO::executeQuery($tempTableQuery);
 
     CRM_Utils_Hook::dupeQuery($this, 'table', $tableQueries);
 
@@ -384,7 +383,8 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
       $query = "SELECT dedupe.id1 as id
                 FROM dedupe JOIN civicrm_contact ON dedupe.id1 = civicrm_contact.id {$this->_aclFrom}
                 WHERE contact_type = '{$this->contact_type}' {$this->_aclWhere}
-                AND weight >= {$this->threshold}";
+                AND weight >= {$this->threshold}
+                 ORDER BY weight DESC ";
     }
     else {
       $this->_aclWhere = ' AND c1.is_deleted = 0 AND c2.is_deleted = 0';
@@ -399,7 +399,8 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
                        LEFT JOIN civicrm_dedupe_exception exc ON dedupe.id1 = exc.contact_id1 AND dedupe.id2 = exc.contact_id2
                 WHERE c1.contact_type = '{$this->contact_type}' AND
                       c2.contact_type = '{$this->contact_type}' {$this->_aclWhere}
-                      AND weight >= {$this->threshold} AND exc.contact_id1 IS NULL";
+                      AND weight >= {$this->threshold} AND exc.contact_id1 IS NULL
+                      ORDER BY weight DESC ";
     }
 
     CRM_Utils_Hook::dupeQuery($this, 'threshold', $query);
diff --git a/CRM/Dedupe/Finder.php b/CRM/Dedupe/Finder.php
index fd5d1c9..108fec4 100644
--- a/CRM/Dedupe/Finder.php
+++ b/CRM/Dedupe/Finder.php
@@ -148,8 +148,8 @@ class CRM_Dedupe_Finder {
 
     $rgBao->params = $params;
     $rgBao->fillTable();
-    $dao = new CRM_Core_DAO();
-    $dao->query($rgBao->thresholdQuery($params['check_permission']));
+    $dao = CRM_Core_DAO::executeQuery($rgBao->thresholdQuery($params['check_permission']));
+
     $dupes = array();
     while ($dao->fetch()) {
       if (isset($dao->id) && $dao->id) {
diff --git a/api/v3/Contact.php b/api/v3/Contact.php
index 38a1172..86846e0 100644
--- a/api/v3/Contact.php
+++ b/api/v3/Contact.php
@@ -1371,13 +1371,15 @@ function civicrm_api3_contact_duplicatecheck($params) {
   );
   $values = array();
   if ($dupes && !empty($params['return'])) {
-    return civicrm_api3('Contact', 'get', array(
+    $result = civicrm_api3('Contact', 'get', array(
       'return' => $params['return'],
       'id' => array('IN' => $dupes),
       'options' => CRM_Utils_Array::value('options', $params),
-      'sequential' => CRM_Utils_Array::value('sequential', $params),
       'check_permissions' => CRM_Utils_Array::value('check_permissions', $params),
     ));
+    
+    $result['values'] = array_replace(array_flip($dupes), $result['values']);
+    return $result;
   }
   foreach ($dupes as $dupe) {
     $values[$dupe] = array('id' => $dupe);
diff --git a/templates/CRM/Contact/Form/Contact.tpl b/templates/CRM/Contact/Form/Contact.tpl
index 384fe48..cdc4b9a 100644
--- a/templates/CRM/Contact/Form/Contact.tpl
+++ b/templates/CRM/Contact/Form/Contact.tpl
@@ -300,7 +300,6 @@
       CRM.api3('contact', 'duplicatecheck', {
         match: match,
         rule_type: 'Supervised',
-        options: {sort: 'sort_name'},
         return: ['display_name', 'email']
       }).done(function(data) {
         var title = data.count == 1 ? {/literal}"{ts escape='js'}Similar Contact Found{/ts}" : "{ts escape='js'}Similar Contacts Found{/ts}"{literal},

@eileenmcnaughton
Copy link
Contributor

@colemanw what should we do with this? I think it seems like to get it mergeable we would need to

  1. create a new inbuilt rule ('date entry'?) that mimicked existing behaviour - people could edit from there.
  2. figure out how to order by weight per the last comment

I think it's a nice improvement - but it also seems not quite mergeable at the moment & I don't think either of us are quite committed enough to get it over the line at the moment? If that is the case we should close the PR & track it from gitlab (we won't lose access to the changes - although if you ALSO delete the branch & we haven't linked into the commit specifically we will struggle to find it again

@colemanw
Copy link
Member Author

colemanw commented May 15, 2018

The trouble with adding a new rule is that we bump against the limitations of the rule categories. Currently we have "Supervised" "Unsupervised" and "General". We could create a new category but that's yet another can-o-worms to open.

IMO we ought to create a new hard-coded rule called "Name or Email or Phone or Address" and make it the default Supervised rule. The current default Supervised rule isn't very good. For existing sites, if they have the current default "Name and Email" default Supervised rule, we switch it. If they've configured a custom Supervised rule, we leave it alone.

That could be done in a separate PR. Just to think out loud, I think the ideal fuzzy rule would do something like

IF (
  (first_name && last_name match) OR
  (nick_name && last_name match) OR
  (email matches) OR
  (phone matches) OR
  (street_address matches)
)

@eileenmcnaughton you know more about optimizing queries on huge databases than I do. Is the above too crazy for large databases to handle?

@eileenmcnaughton
Copy link
Contributor

OK - so to answer that I think I need to feel more comfortable with how the existing Supervised rule is used - ie. audit the places. I realise my discomfort with the idea is about the places it is used. TODO audit this.

My inclination is that adding a DataEntry category is safer because it involves changing behaviour in less places. Where it differs logically from the 'normal' concept of 'Supervised' is that we would want it to be more aggressive about suggestions - ie. as you type 'Coleman' suggestions for 'Coleman' would appear (for a large DB they might tweak that but on a small DB that's a good rule).

Regarding whether it should be a hardcoded rule - I can do some tests - I'm not sure that the hard-coded rules are more efficient from my previous tests in this instance.

@eileenmcnaughton
Copy link
Contributor

@colemanw I had an idea here.

On one side of this patch we have a significant UI improvement on the other we have a change in behaviour to the selection of which contacts are duplicates that makes me uncomfortable. Could we just get the first part commited - ie. write a (deprecated) api or ajax call that does the same look up as right now & use that & then review (if we choose) switching the mechanism?

@colemanw
Copy link
Member Author

@eileenmcnaughton well the current implementation of the ajax lookup leaves a lot to be desired. It matches on last_name and nothing else.
I suppose now that the api3 supports the OR operator I could fire off a Contact.get api call to match on first name or last name or email, without going through the dedupe engine; at least that would be an improvement.

@eileenmcnaughton
Copy link
Contributor

@colemanw Yep & we could get that merged & then consider improving the rules.

@colemanw
Copy link
Member Author

@eileenmcnaughton done. I also added some throttling to prevent multiple messages popping up if the user fills out the form quickly.

@eileenmcnaughton
Copy link
Contributor

test this please

@eileenmcnaughton
Copy link
Contributor

I just tested this & it feels better from a UI POV & I didn't spot any issues - I think it's good to merge

@monishdeb
Copy link
Member

Merging as per tag.

@monishdeb monishdeb merged commit c47a8ed into civicrm:master Jun 5, 2018
@monishdeb
Copy link
Member

@colemanw @eileenmcnaughton please close the JIRA ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants